python
is a programming language and also the name of the program that runs scripts written in that language.ipython
with something like ipython my_script.py
or python
with something like python my_script.py
ipython
over python
. This is because ipython
has a bunch of features like tab completion, inline help, and easy access to shell commands which are just plain great (more on these in a bit).shift + enter
while it's selected (or click the play button toward the top of the screen). +
means 'press at the same time', ,
means 'press after'Enter
-- Run this cell and make a new one after itEsc
-- Stop editing this cellOption + Enter
-- Run this cell and make a new cell after it (Note: this is OSX specific. Check help >> keyboard shortcuts to find your operating system's version)Shift + Enter
-- Run this cell and don't make a new one after itUp Arrow
and Down Arrow
-- Navigate between cells (must be in command mode)Esc, m, Enter
-- Convert the current cell to markdown and start editing it againEsc, y, Enter
-- Convert the current cell to a code cell and start editing it againEsc, d, d
-- Delete the current cellEsc, a
-- Create a new cell above the current oneEsc, b
-- Create a new cell below the current oneCommand + /
-- Toggle comments in Python code (OSX)Ctrl + /
-- Toggle comments in Python code (Linux / Windows)
In [1]:
# you don't have to rename numpy to np but it's customary to do so
import numpy as np
# you can create a 1-d array with a list of numbers
a = np.array([1, 4, 6])
print 'a:'
print a
print 'a.shape:', a.shape
print
# you can create a 2-d array with a list of lists of numbers
b = np.array([[6, 7], [3, 1], [4, 0]])
print 'b:'
print b
print 'b.shape:', b.shape
print
In [2]:
# you can create an array of ones
print 'np.ones(3, 4):'
print np.ones((3, 4))
print
# you can create an array of zeros
print 'np.zeros(2, 5):'
print np.zeros((2, 5))
print
# you can create an array which of a range of numbers and reshape it
print 'np.arange(6):'
print np.arange(6)
print
print 'np.arange(6).reshape(2, 3):'
print np.arange(6).reshape(2, 3)
print
# you can take the transpose of a matrix with .transpose or .T
print 'b and b.T:'
print b
print
print b.T
print
In [3]:
# you can iterate over rows
i = 0
for this_row in b:
print 'row', i, ': ', this_row
i += 1
print
# you can access sections of an array with slices
print 'first two rows of the first column of b:'
print b[:2, 0]
print
In [4]:
# you can concatenate arrays in various ways:
print 'np.hstack([b, b]):'
print np.hstack([b, b])
print
print 'np.vstack([b, b]):'
print np.vstack([b, b])
print
In [5]:
# note that you get an error if you pass in print 'np.hstack(b, b):'
print np.hstack(b, b)
print
In [6]:
# you can perform matrix multiplication with np.dot()
c = np.dot(a, b)
print 'c = np.dot(a, b):'
print c
print
# if a is already a numpy array, then you can also use this chained
# matrix multiplication notation. use whichever looks cleaner in
# context
print 'a.dot(b):'
print a.dot(b)
print
# you can perform element-wise multiplication with *
d = b * b
print 'd = b * b:'
print d
print
a.dot(b)
Out[6]:
In addition to arrays which can have any number of dimensions, Numpy also has a matrix
data type which always has exactly 2. DO NOT USE matrix
.
The original intention behind this data type was to make Numpy feel a bit more like Matlab, mainly by making the *
operator perform matrix multiplication so you don't have to use np.dot
. But matrix
isn't as well developed by the Numpy people as array
is. matrix
is slower and using it will sometimes throw errors in other people's code because everyone expects you to use array
.
In [7]:
# you can convert a 1-d array to a 2-d array with np.newaxis
print 'a:'
print a
print 'a.shape:', a.shape
print
print 'a[np.newaxis] is a 2-d row vector:'
print a[np.newaxis]
print 'a[np.newaxis].shape:', a[np.newaxis].shape
print
print 'a[np.newaxis].T: is a 2-d column vector:'
print a[np.newaxis].T
print 'a[np.newaxis].T.shape:', a[np.newaxis].T.shape
print
In [8]:
# numpy provides a ton of other functions for working with matrices
m = np.array([[1, 2],[3, 4]])
m_inverse = np.linalg.inv(m)
print 'inverse of [[1, 2],[3, 4]]:'
print m_inverse
print
print 'm.dot(m_inverse):'
print m.dot(m_inverse)
In [9]:
# and for doing all kinds of sciency type stuff. like generating random numbers:
np.random.seed(5678)
n = np.random.randn(3, 4)
print 'a matrix with random entries drawn from a Normal(0, 1) distribution:'
print n
X_no_constant
. This is a common task in linear regression and general linear modeling and something that you'll have to be able to do later today. betas
vector below to make a vector called y
Specificically, given a matrix:
\begin{equation*} \qquad \mathbf{X_{NoConstant}} = \left( \begin{array}{ccc} x_{1,1} & x_{1,2} & \dots & x_{1,D} \\ x_{2,1} & x_{2,2} & \dots & x_{2,D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i,1} & x_{i,2} & \dots & x_{i,D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N,1} & x_{N,2} & \dots & x_{N,D} \\ \end{array} \right) \qquad \end{equation*}We want to convert it to: \begin{equation*} \qquad \mathbf{X} = \left( \begin{array}{ccc} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,D} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,D} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{i,1} & x_{i,2} & \dots & x_{i,D} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N,1} & x_{N,2} & \dots & x_{N,D} \\ \end{array} \right) \qquad \end{equation*}
So that if we have a vector of regression coefficients like this:
\begin{equation*} \qquad \beta = \left( \begin{array}{ccc} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_j \\ \vdots \\ \beta_D \end{array} \right) \end{equation*}We can do this:
\begin{equation*} \mathbf{y} \equiv \mathbf{X} \mathbf{\beta} \end{equation*}
In [14]:
a = np.ones(n_data)[np.newaxis].T
a
Out[14]:
In [16]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5 # number of dimensions of each datapoint. i.e. D
betas = np.random.randn(n_dim + 1)
X_no_constant = np.random.randn(n_data, n_dim)
print 'X_no_constant:'
print X_no_constant
print
# INSERT YOUR CODE HERE!
X = np.hstack([np.ones(n_data)[np.newaxis].T, X_no_constant])
y = np.dot(X, betas)
# Tests:
y_expected = np.array([-0.41518357, -9.34696153, 5.08980544,
-0.26983873, -1.47667864, 1.96580794,
6.87009791, -2.07784135, -0.7726816,
-2.74954984])
np.testing.assert_allclose(y, y_expected)
print '****** Tests passed! ******'
In [19]:
# like with numpy, you don't have to rename pandas to pd, but it's customary to do so
import pandas as pd
b = np.array([[6, 7], [3, 1], [4, 0]])
df = pd.DataFrame(data=b, columns=['Weight', 'Height'])
print 'b:'
print b
print
print 'DataFame version of b:'
print df
print
In [20]:
# Pandas can save and load CSV files.
# Python can do this too, but with Pandas, you get a DataFrame
# at the end which understands things like column headings
baseball = pd.read_csv('data/baseball.dat.txt')
# A Dataframe's .head() method shows its first 5 rows
baseball.head()
Out[20]:
In [22]:
# you can see all the column names
print 'baseball.keys():'
print baseball.keys()
print
# print 'baseball.Salary:'
# print baseball.Salary
# print
# print "baseball['Salary']:"
# print baseball['Salary']
In [23]:
baseball.info()
In [24]:
baseball.describe()
Out[24]:
In [26]:
# baseball
In [34]:
# You can perform queries on your data frame.
# This statement gives you a True/False vector telling you
# whether the player in each row has a salary over $1 Million
millionaire_indices = baseball['Salary'] > 1000
# print millionaire_indices
In [28]:
# you can use the query indices to look at a subset of your original dataframe
print 'baseball.shape:', baseball.shape
print "baseball[millionaire_indices].shape:", baseball[millionaire_indices].shape
In [33]:
# you can look at a subset of rows and columns at the same time
print "baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]:"
baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']].head()
Out[33]:
In [30]:
# load shoe size data
shoe_size_df = pd.read_csv('data/baseball2.dat.txt')
shoe_size_df
Out[30]:
In [31]:
merged = pd.merge(baseball, shoe_size_df, on=['Name'])
merged
Out[31]:
In [32]:
merged_outer = pd.merge(baseball, shoe_size_df, on=['Name'], how='outer')
merged_outer.head()
Out[32]:
Partner up with someone next to you. Then, on one of your computers:
X_df
below. Name the new column 'const'.X_df
by the betas
vector and assign the result to an new variable: y_new
Hint: This stackoverflow post may be useful: http://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
In [36]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5 # number of dimensions of each datapoint. i.e. D
betas = np.random.randn(n_dim + 1)
X_df = pd.DataFrame(data=np.random.randn(n_data, n_dim))
# INSERT YOUR CODE HERE!
X_df['const'] = np.ones(n_data)
y_new = np.dot(X_df, betas)
# Tests:
assert 'const' in X_df.keys(), 'The new column must be called "const"'
assert np.all(X_df.shape == (n_data, n_dim+1))
assert len(y_new == n_data)
print '****** Tests passed! ******'
In [37]:
X_df
Out[37]:
In [ ]: